Perform an analysis of “data scientist” jobs listed on job boards and on the employment pages of major companies. What are the most common skills that employers look for? What are the most unique skills that employers look for? Where are the types of companies that employ the most data scientists?

1. Scrape Data from Job Search Boards

I have begun scraping data from Stack Overflow by searching for jobs with the term “Data Scientist.” Here’s some of the data I’ve been able to extract:

position company location job_type experience industry company_size company_type link
Data Analyst Jane.com Lehi, UT Permanent Junior, Mid-Level Retail Industry 51-200 people Private https://stackoverflow.com/jobs/155681/data-analyst-janecom
Data Scientist National Security Agency Fort Meade, MD Permanent Mid-Level Cybersecurity, Federal Agencies, Signals Analysis 10k+ people Public https://stackoverflow.com/jobs/155241/data-scientist-national-security-agency
Data Scientist Grindr West Hollywood, CA Permanent Mid-Level Big Data, Social Media, Software Development 51-200 people Private https://stackoverflow.com/jobs/151199/data-scientist-grindr
Data Scientist StreamDetroit Ferndale, MI Contract Junior Content Marketing, Digital Marketing, SaaS 11-50 people Private https://stackoverflow.com/jobs/154094/data-scientist-streamdetroit
Data Scientist BuzzFeed Los Angeles, CA Permanent Mid-Level, Senior, Lead Digital Media, Entertainment, News 1k-5k people VC Funded https://stackoverflow.com/jobs/153762/data-scientist-buzzfeed
Data Scientist PlentyOfFish Media ULC Vancouver, BC, Canada Permanent Senior Information Technology 51-200 people Public https://stackoverflow.com/jobs/152134/data-scientist-plentyoffish-media-ulc
Lead Data Scientist Careem Berlin, Germany Permanent Senior, Lead, Manager Software Development / Engineering, Transportation 1k-5k people VC Funded https://stackoverflow.com/jobs/148268/lead-data-scientist-careem
Data Scientist / Engineer SemanticBits No office location Permanent NA Digital Health 51-200 people NA https://stackoverflow.com/jobs/141227/data-scientist-engineer-semanticbits
SQL Data Analyst Detroit Trading Company Birmingham, MI Permanent NA Automotive 51-200 people Private https://stackoverflow.com/jobs/151026/sql-data-analyst-detroit-trading-company
Business (Data) Analyst Shane Co. Centennial, CO Permanent Mid-Level Consumer Products, Jewelry, Luxury Goods 501-1k people Private https://stackoverflow.com/jobs/155412/business-data-analyst-shane-co

There are 167 unique jobs listed on Stack Overflow that come up in my latest search.

In addition to the manual scraping I have performed, I have also received a database of jobs posted on Stack Overflow over the years from Dave Robinson, a Data Scientist at Stack Overflow. Per his request, the data will be kept private.

2. Visualize Job Postings by Location

Here, I visualize the Stack Overflow jobs using data provided from Dave Robinson. Markers are color coded by the year that they were posted. I use the function geocode from the ggmaps package to obtain latitude and longitude coordinates on the cities. I use code from this repository to jitter the latitude and longitude, so that multiple points per city can be seen. The map is interactive: select data points by year to view the geographic trends in job listings!

Here is a plot (yes, I figured out how to use ggplot…) that shows trends in number of jobs posted by US region over the past several years.

3. Skills that Employers are Looking For

Here is a word cloud of technical skill tags from the Stack Overflow job listings (data dump, not manual scraping):

More EDA:

Months with the fewest job postings by year
month year n month_yr
08 2010 1 08-2010
04 2011 1 04-2011
01 2012 1 01-2012
08 2013 1 08-2013
04 2014 10 04-2014
08 2015 10 08-2015
06 2016 10 06-2016
02 2017 18 02-2017
Months with the most job postings by year
month year n month_yr
10 2010 3 10-2010
02 2011 4 02-2011
08 2012 5 08-2012
06 2013 11 06-2013
06 2014 22 06-2014
07 2015 24 07-2015
02 2016 35 02-2016
09 2017 46 09-2017

10 most common majors listed in job postings
Major n
Computer Science 469
Statistics 436
Engineering 301
Mathematics 211
Physics 116
Intelligence 89
Applied Mathematics 75
Operations Research 52
Artificial Intelligence 26
Bioinformatics 20

Things to do next:

Analysis Plan:

  • Look at trends in programming over time (while jobs only last about 4 weeks on Stack Overflow, I now have access to 8 years of postings…)
    • consider limitation: hiring season/scraping bias
    • undersampling/length bias - jobs in higher demand
  • PCA clustering - skills by sector
  • Are the public companies more likely to post more jobs? Private companies? Bigger companies?
  • Compare job postings from two different job boards

Limitations:

  • Reproducibility issue: data are unique to when they are scraped, and I’m obtaining data from a data dump
    • list when data were pulled in report, save csv
    • one option: scrape data every so often, saving csv files, take union of rows to maximize number of postings
  • Maybe not generalizable: potential bias in just looking at Stack Overflow, could attract job postings from certain types of industries